The Explainer Notebook

Structure of this notebook

The notebook consists of six parts; Motivation, Downloading data, Data cleaning & Preprocessing, Basic Statistics, Tools, theory and analysis, Discussion and Contributions.

In the first part regarding motivation a description of the project goal is provided alongside an explanation of the dataset used. The dataset is downloaded in the Downloading data section and then cleaned and preprocessed.

In the Data cleaning & Preprocessing part additional information about each artist is found by using regular expression on the downloaded txt. files. The additional information contributes with a deeper understanding of data. Which is covered in the Basic Statistics section. The Basic Statistics aims to get an understanding of the dataset by using explorative statistics. This section provides an initial understanding of the data and an idea of what to investigate further.

The Tools, theory and analysis part contains two main parts; Network Analysis and Text Analysis. Each of these are divided into further sub-analyses. In the beginning of each sub-analysis a short presentation of theory and purpose is made. Throughout this section the tools will be explained when they are applied. At the end of each sub-analysis, the outcome is commented on.

The Discussion will both discuss main results, data quality and further work.

In the end of the explainer notebook the Contribution describes which part of the project each member has been responsible for.

Part 1: Motivation

1.1 Motivation and goal


While every country has it's own language, there is only one language we can all understand. The Language of Music.


Music is an important part of the everyday life. It is almost inevitable to come across music in a regular every day. Whether it is music on the radio, TV, at a party, in the bus or just pure background noice from different sources. Music undeniable affects our life. And it is important.

It brings us together  
It can improve health and wellbeing  
It can improves confidence and resilience  
It is a creative outlet  
and last but not least; Music is fun.  

(At least according to this article.)

Music all comes down to the artists. If there are no artists, there are no songs to listen to. How does the artists links to one another and what kind of information can be derived from the different groupings? We know of a lot of icons that for a long time have dominated the hitlists. How are these represented in the data?

Music affects us in some way or another. But which words are we actually hearing when we put on our headphones? What kind of messages are we exposing ourselves to? Are they in general negative or positive? How is the mood in the songs that we most often choose to listen to? We know of countless of love songs. Are these represented in the data? Or what other topics dominate the data?

These are questions that this study seeks to answer. The aim of this study is to investigate different artists and compare different genres to achive deeper insight into the music industri. Prior to starting the study the scope of the project should be limited. There are quite frankly too many artists across the world for this study to take them all into account with the limited computer power that we possess. Hence in order to make it manageble this study will focus on the following five music genres.

                                                                            RnB    |    Pop    |    Hiphop    |    Rock    |    Country                

1.2 Data

The data for the study is based on the Wikipedia lists of important artists within each genre to limit the scope of artists in the network. The wikipedia lists can be found on the links above.

For the network analysis each artist will be represented as a node, whereas the edges between the nodes represents wheather an artists page links to another artist. In this way the network will illustrate which artists are connected to other artists based on their wiki-pages. The amount of links between two artist is the weight of the edges which is used later on in the network analysis. The data consists of the following attributes.

alt text

For the sentiment analysis and topic modelling the genius API will be used to extract songs. To represent the genres the most polular songs are extracted from the most polular artists within each genre. Due to the limits of computer power the dataset contains 50 songs/artists from each genre leading to a total datasize of 250 songs.

This study is based on a sample of the whole music industri and the results provides only an indication of patterns within the whole industri.

Libaries
The libraries used for this project are presented and imported below.

Part 2: Downloading data

Data for the network analysis

In the following code the list of artists is downloaded from the five different wikipedia pages RnB, Pop, HipHop, Rock, Country.

The data is downloaded by using the wikipedia API, which is a way of communicating with the web. Regular expressions are also used in order to obtain the name of each artist. The names are then saved in a pandas DataFrame. The result of the following code chunk is five different dataframes (one for each genre) with the names of all artists. The dataframes are saved as five different .csv files.

In the following code the function replaceunicode is applied. This function decodes the wikipedia text. The function is written under the section Functions.

The names of all artists have now been downloaded. In the following code the dataframes are combined to a single dataframe.

The following code loops through all artists in the big dataframe in order to read and save the wikipedia page of all artists. Some of the names are redirected when inserted into the wiki API. If this happens the code will save the redirected name and use that in the API instead. This is nessecary in order to obtain the text of all artists. Furthermore some of the names contains a '/' which is a invalid value in the name of a .txt. file. Hence if this happens, the '/' is replaced with '_'.

Warning: Please be aware that the following code takes a really long time to run.

Part 3: Data cleaning & Preprocessing

Before proceeding to the Basic Statistics the data needs to be cleaned and preprocessed. This will be done in this section.

Furthermore the data saved in the file named df_all5.csv contains only very basic information about the artists in the form of the name and genre. In this section the .txt file for each artist will therefore be used in order to extract more usefull information about the artists.

To begin the csv file containing all artists/bands from all 5 genres is downloaded.

In the dataset there exist duplicates on the subset (genre, link_name). An example on this is the band named 'Blaque'. Both of the rows shown down below redirects to the wikipage named 'Blaque'. Hence one of these rows should be dropped.

In total there are 31 rows that appears as duplicates on the subset (genre, link_name). In the following code these are dropped.

In the inital dataframe an artist can have multiple rows if they appear in multiple genres (like Beyoncé). The genre is redefined, such that it appears as a string with of genres. If an artist appears in multiple genres the genres will be separated by commas in the string.

In the dataframe the 12 last rows consist are wrongfully loaded and have to be dropped:

Adding attributes

The wikipages contains a lot of relevant information about the artists/bands. In the following sections attributes describing the artists/bands will be extracted from their wikipages using regular expression.

Adding origin
The origin is added by extracting information about the birth place/origin from the saved wiki pages. The origin is stated in different ways in the .txt files, hence multiple regular expressions/patterns are needed.

Adding longitude and latitude: lon and lat

Warning: Please be aware that the following code takes a really long time to run.

Adding years active: start_year and end_year
We can also add information about the starting and end date of the artists/bands carreer.
Some artists have breaks in their carreer, which means that multiple years can be stated in their information about 'Years active', e.g. the danish band Aqua). In this project the start_year is defined as their first year active and their end_year is defined as their last year active. Some artists is still active, thus for these artists their end_year will be set to 'Present'.

Adding genius id and number of followers: genius_id and followers_count
The genius id and number of followers is found by using the Genius API.

Warning: Please be aware that the following code takes a really long time to run.

The amount of followers is found by using the RapidAPI page. Hence a user have been created in order to use their API.

Warning: Please be aware that the following code takes a really long time to run.

The data is now ready for analysis.

Part 4: Basic statistics

This section will give an introduction to the considered dataset.

Genre distribution:
To begin with the genre distribution will be presented. It is important to look at the distribution in order to conclude whether or not the data is dominated by a genre. The genre distribution is plotted using the plot_attribute_dist function, which can be seen under the section Functions (make sure to run the function before use).

Before looking at the genre distribution a flattened dataframe is created. This is due to the fact that an artist can be in multiple genres. If the list is not flattened the genres would take values such as 'RnB, HipHop', but in this case we only wish to see the five genres overall.

From the above plot it can be seen how the genres HipHop and Country is the largest genres in the data, while the genre Pop is the smallest.

Active artists over time:
Active artists is defined as artists that at a given time have started their carreer and not yet ended it. It is useful to look at the active artists over time, as it gives a good understanding of the data, that we are dealing with. Music have existed for ages, but it keeps developing and new artists keep emerging. Furthermore Wikipedia is a modern invention, hence it is interesting to see which artists that are represented in wikipedia and how many over the years.

The function active_count is defined below and computes the number of active artists per year

Active artists total

Active artists divided into genres

Plotting active artists (total) and actives artists (genres):

In the above plot there is a tab showing a plot of the number of active artists over time across all genres. This plot shows a large increase in the amount of artists in the years around 1980 to 2010. In the second tab there is a plot of the active artists over time divided into the different genres. Based on this plot it can be concluded that HipHop is the largest genre, while Pop is the smallest genre. Futhermore it can be seen how HipHop has a rather rapid increase compared to all other genres.

Followers
The information about followers can give an indication of which artists are popular and thereby which artists that might dominate the network. To begin with it is interesting to look at the average degree in each genre. From the genre distribution it was clearly HipHop and Country who dominated the data. Thus it is interesting to see if these genres also dominate in terms of average amount of followers.

Computing the average amount of followers for each genres

In terms of average amount of followers it is once again HipHop that dominate. One should also notice that Pop, who is the smallest genre in terms of artists, actually have the second highest amount of average followers. Furthermore Country have the lowest amount of average followers, dispite the fact that it is the second largest genre in the data. Based on the above table it can also be seen how the average amount of followers on Genius in general is quite low compared to other social medias. It is interesting to look at the top 5 artists with most followers within each genre, as it might support this tendency.

From the above plot it can be seen how the HipHop genre in general have more followers than any of the other genres. Futhermore it can be seen how the artist with most followers is Kendrick Lamar with 26.291 followers. If this value is compared with his amount of followers on spotify (30.2M monthly listeners) and instagram (9.7M followers) it is a very small amount. Hence this indicates that people in general do not use Genius to any great extent. Thus the amount of followers do not show a true picture of the reality.

Part 5: Tools, theory and analysis

Introduction

The study contains three analysis, network analysis, sentiment analysis and topic modeling.

5.1 Network analysis

Introduction

In this part of the study the network will be created and analyzed. The network will be created such that nodes represents artists and links represents links between wikipedia pages. If an artist's page links to another artist's page these nodes will share a directed link. Thus the initial network is directed. From the initial network the Giant Connected Component (GCC) will be extracted and analyzed. Due to a high degree density a backbone analysis with a disparity filter will be carried out. The result of the backbone analysis is a smaller network, where only significant links appear. The backbone network will be used for further analysis. Due to the way that the links are weighted in the backbone analysis, the network needs to be undirected. Therefore, the network is saved as an undirected network, which forms the foundation of the rest of the analysis. In the end communities are created in an attempt to figure out, how well the network actually represents common assumptions of linkages between music artists.

Network creation
To begin with the network is created by creating a directed graph in which a node is added for each character. Attributes are added to the nodes from the dataframe. Hereafter the column link_name is used to search through the nodes/artists wiki pages and create links from the links to other pages.

Because we already have a lot of nodes and links, and we do not want artists in the network who are too unknown (no links) we extract the GCC from the network. For visualisation purposes we also make the network undirected.

The density of connections

Later on, the so called backbone of the network will be extracted. This is in part because the large number of links in the network, which in part can be seen from the average degree. The graph has a high density of connections, ⟨k⟩ = 19, making difficult both its analysis and visualization.

Degree distribution

Next we look into the degree distribution of the network. In the figure below this is shown for each of the genres. If you want to compare the genres separately, then click their respective name in the legend on the left. Also, keep in mind the total number of artists in each genre shown in the Basic stats section when looking at the scale-difference here.

The next code cells first prepares the data for plotting, and then visualises the degree distribution of all genres in the same plot.

The degree distribution of the genres seems most reminiscent of a power-law distribution, but the genres Hip Hop and Rock could also be fitted with a Poisson distribution. It is also noticeable that there clearly are a few artists in each category with a lot more links than the rest of the artists.

Next we are plotting the difference in In-degree and out-degree of the directed GCC:

Top 15 most connected artists
In the following the 15 most connected artists are presented. These are found by computing the degree of each artist.

From the above table it can be seen how all 15 of the most connected artists are wellknown artists. Furthermore 9 out of the 15 artists belong to the HipHop genre. This is in good compliance with HipHop being the largest genre in the network.

Drawing the network
The network is drawn in the code sections below. The function Force Atlas in under the section Functions helps determine the positions of the nodes.

The network is now plotted using the Networkx draw function function, and node-coordinates from the force atlas algorithm.

Due to the large number of nodes and links, it is close to impossible to identify the specific connections in the network. However, it is remarkable, that there is clear separation between the different genres. The top left part of the network is dominated by Hip Hop artists. The red and blue nodes, representing Rock and Country artists respectively, are equally isolated in different parts of the network graph. The RnB artists are located very close to the Hip Hop artist, which makes sense when relating it to the common conception about the similarities in these genres. Also, it has been discovered for the yellow nodes – namely, the artists contained in multiple genres – that most of these artists are exactly in both the Hip Hop and RnB categories. When it comes to the Pop artists, their position in the middle of the network graph also conceptually matches with the fact, that many Pop artists will use elements from different genres to be “popular” for everyone.

Summary:
Even though there are clear more nodes with few connections than many, the average degree of the network is above 19. This, together with the fact that the network is visually very difficult to inspect, have led to a Backbone Analysis of the network.

Backbone Analysis

The method for extracting the backbone can then be carried out according to the presented technique in this article from 2009 by M. A. Serrano et al. The purpose of the method is find a significance level α for which a sufficient amount of link-weight is kept in the network, while the total number of edges is greatly reduced. In order to select suitable significance level α we look at the figure below. On the left we se $N_B/N_T$ as a function of $W_B/W_T$ at different significance levels, where N is nodes and W is weight and the subscripts B and T refers to the backbone and total network repectively. To the right we se $N_B/N_T$ against the number of edges $E$ kept at different significans levels.

In order to conduct the backbone analysis, the links of the network should be weighted in a way that represents the importance of the indivial links. For this analysis the weight of a link is chose to be equal to the number of times it is included on the wiki pages - in both directions. This means that if node i mentions node j two times, and node j mentions node i once, then the resulting weight of the link will be 3. This leads to the conclusion, that the network must be undirected for this weighting to be possible.

The function disparity_filter written in section Functions is from this page: github, backbone
The function is used to find the backbone of the network. Different values of $\alpha$ is tested in the following section in order to find the optimal value.

From the output above it is clearly seen that the number of nodes and edges is decreasing as the value of $\alpha$ is decreased. By decreasing the value of $\alpha$ the more significant links are kept.

Plotting the fraction of nodes kept in backbones as a function of respectively the weight and the number of edges retained by the different $\alpha$-filters.

Below is a table showing the sizes of the disparity backbones in terms of the percentage of total weight, nodes and edges for different values of the significance level $\alpha$.

$$\alpha$$ $$\%W_T$$ $$\%N_T$$ $$\%E_T$$
0.5 73.0 93.0 38.0
0.4 66.6 85.6 27.5
0.3 60.2 79.5 20.8
0.2 51.6 73.6 15.0
0.1 40.5 62.1 9.0
0.05 33.0 50.6 6.2
0.01 19.4 31.7 2.8

It is obviously a bit subjective decision that have to be made on which significance level that is be chosen. But we argue in this case, that because the weight kept is more or less directly proportional to the value of α, then we can choose from looking mostly at the right graph of the edges kept. Here it is clear that there up to α = 0.1 we get to remove more edges than at higher α values - therefore this is the chosen significance level.

Analysis of backbone network

This section seeks to inspect the backbone network further. This is done first by looking at the distribution of artits in the different genres again. Then we compare the new network to the old by looking at the most connected nodes/artists and the degree distribution.

First, the network is built again with attributes:

Below the distribution of artits in the different genres is shown again. Clearly, the difference in number of nodes in the genres has not changed a lot since before extracting the backbone. This is good, as it indicates we can keep most of the node information without all the links.

Top connected artists/bands

To compare the backbone to the old network we also look at the 15 most connected artists again. Most of them are the same as before the backbone was extracted, but there are also some differences. E.g. Michael Jackson is no longer in the top 15 most linked artists. Instead there is some new names such as Missy Elliot and Mariah Carey.

Degree distribution

Again, to compare with the old network, the degree distribution is plotted again for the backbone network. As expected the degrees are generally much lower, and the distribution seems to follow a power-law distribution very well.

Drawing the backbone network

Again we draw the network to illustrate the difference from before to after the backbone analysis.

When the backbone is extracted, it looks like the network shown above. The number of nodes have been reduced to 2955 and the number of links to 4701. This means that we get to keep more than 60% of the nodes while only retaining less than 10% of the links, and that is exactly what the goal of this section is about. In the network visualisation above it is also evident, that we still have good seperation between the genres, even with a lot fewer links.

Communities

Now that the network only contains what we find to be the "important" links, the communities are found. This is done with the Louvain algorithm, which is an approximation algorithm that tries to maximize the modularity score $Q$ (Guillaume et al., 2008).

When implemented on the backbone network, we end up with 38 communities and a modularity score of $Q = 0.76$, which indicates a fairly good partition (Barabási, 2015).

Next, the communities are renamed after the 3 nodes within each of them with the highest degree. This will make interpretation of the communities much easier.

Community sizes

As mentioned, a goal with the network analysis is to study whether or not the communities compare to the some common conception about the artists relations. Therefore, the number of artists within each genre in each community is also below below. Clearly, a lot of the communities are dominated by a a single genre, but there are also expections such as the communities Nicki Minaj, Chris Brown, Beyoncé and Mariah Carey, Whitney Houston, Michael Jackson. Remember again here, that there are not equally many artists in each genre in the whole network, and therefore these two communities can be seen as quite "diverse".

Interactive plot of selected communies

To look closer at some of the largest communities, an interactive visualisation can be seen below. Here the two country communities Carrie Underwood, Reba McEntire, Taylor Swift and Johnny Cash, Willie Nelson, Elvis Presley are shown together with the rock community Nirvana (band), Radiohead, Alice in chains, and the Hip Hop communities Eminem, Dr. Dre, Snoop Dogg and Kanye West, Drake (musician), Rihanna. Futhermore the two mixed communities Mariah Carey, Whitney Houston, Michael Jackson and Nicki Minaj, Chris Brown, Beyoncé are included.

The output from the above code can be viewed at the website. It is the plot below the barplot showing the community sizes. You can zoom in on the different communities to look at the individual artists in each of them. If you are into music, you will hopefully agree, that most of the artists are in a community that seems sensible. There are of course also examples of nodes/artists that might not seem to belong in their community, e.g. DJ Tiësto and Swedish House Mafia might not be a natural inclusion with Nirvana and the other rock bands. Another thing to notice is how the two mixed communities also have more yellow nodes. The fact that these artists who are in multiple genres are represented more in these communities could indicate, that they also can act as links between artists in different genres.

Community partition analysis

We have now seen how the genres generally offer a good explanation for the community partition. However, one thing that is still left unanswered is what causes there to be multiple cummunities dominated by the same genres and not just a single one. In the next section we seek to explain some of this by looking into time and space attributes of the communities. The graph below shows when the artists from a selection of communities started their careers.

If one selects the first two communities on the legend to the right of the graph, you will se that there indeed is some temporal difference between these two Country communities. Most of the artists in Johnny Cash, Willie Nelson, Elvis Presley started their career around the 1960s, where the artists in Carrie Underwood, Reba McEntire, Taylor Swift mostly started their career after 1980. The exact same can be seen if you instead select the next three communities - namely the hip hop communities. Here we see that most of the artists in Eminem, Dr. Dre, Snoop Dogg started their career between 1985-1995, in Kanye West, Drake (musician), Rihanna they started between 2004 and 2014 and in the community of new rappers XXXTentacion, Lil Durk, Juice Wrld almost all started after 2010. Again, for the last two communities before refered to as to mixed communities, the same can be said.

The next explanation the separation of communities is related to where in the world the artits come from.

On the map above are two of the biggest communities dominated by rock artists shown. The nodes position on the map corresponds to the artist's origin. Here it can be seen how one community almost only originates from North America, while the other community contains a lot of artists from the UK.

Plotting of backbone by community

In the network below, the backbone is shown again, now with each community colored according to its community.

Network analysis outcomes

In this section , the linkage of music artists in and between different genres have been explored. As a result multiple facts can be concluded about the network. Firstly, it can be concluded, that already before the backbone analysis, the genres seem to be a good partitioner for the network. This could be seen in the first network plot on this page. However, it was discovered that the network was so densely connected that further analysis could become difficult, therefore the backbone of the network was extracted. With the disparity filter method applied it was possible to remove more than 90% of the links while still keeping most of the important information. The bachbone network not only made the network more manageable, but is also made it possible to devide the network into communities of sensible sizes. The goal was to evaluate communities on the basis of how well they represent common assumptions of linkages between music artist. This leads to a somewhat subjective judgement of the communities, but all in all it can be concluded that the communities to a large extend represents the linkages one would have assumed. To some extent this can be explained by temporal and spacial attributes and of course the genres of the artists.

5.2 Data for the sentiment analysis and topic modeling

The file "GeniusAPIartists.csv" contains the artist infomrations from the network analysis including the number of degrees per artist which will be used later on. The artists that are not on genuis has been removed for the API to work.

The length of the dataframe has been calculated to see how many artists that are included in df. The number is fewer than for the network analysis due to some artists not being on genius.

Each song should represent one genre for the sentiment analysis and topic modeling to work perfectly. Otherwise some songs will appear twice. Each artist will therefore only reprsent one genre.

The size of the dataframe has been reduced.

Top 50 artist within each genre has been extracted. This is based on the degree for each artist from the network analysis.

Genius API has been used to extract each artist's most listen to song. This song will represent the artist.

A dataframe for all songs is created

The song dataframe is combined with the top50 dataframe so that each song is assigned to a genre for later analysis.

5.3 Sentiment Alaysis

Sentiment analysis goal

The purpose of sentiment analysis in our study is to get insight into the emotional part of the songs. We start by looking at the overall picture and investigate the distribution of sentiment, revealing wheather the top listen songs provides a positive or negative message. To get a deeper insight we investigate the sentiment for each genre to see if some genre is more violent/negative than others. Furthermore the study will include an evaluation of the models performance of classifying the sentiment of the songs. In the end we look into the most positive and negative songs and what words discribes the sentiment groups to understand what makes a song negative/positive. Hopefully this part of our study will provide knowledge of the emotional state of songs and how the genres differents on their sentiment. So that you in the future can chose you playlist depending on your mood.

Sentiment tool

A sentiment analysis model is a machine learning technique that can classify a text as positive, neutral or negative. Analysing language is called Natural Language Processing (NLP) and contains both sentiment analysis and topic modeling which we will cover in this study. For the sentiment analysis the goal is that the model should understand all kind of underlying tones in a text. This includes negation, capitalization, linguistic, context, symboles and sarcasm and takes these into accounting when classifying the sentiment. An example of a negation that the sentiment model should be able to capture can be seen in the sentence, ”I do not like you”. The word like is normally classified as positive, but due to the linguistic structure the words in front of like namely "I do not" yields negation of like, making the sentence negative.

There exist many sentiment analysis models, all for different purpose and each with their stregnth and weaknesses. Due to the often extream word used in songs the choosen model is Vader. Vader(Valence Aware Dictionary for Sentiment Reasoning), is trained on social media data and is therefore caperble of understanding intensity (”!!!”), acronyms (”LOL”) and slang. The model classifies based on full sentence structure since it relies on wordorder. The Vader sentiment model calculates a sentiment score(compound score) on the scale [−1;1] and then sets a threshold to classify the text as positive, neutral or negative. See The Vader Sentiment Model github page for detailed describtion of the model. For this study we use the Vader sentiment classification which are as followed:

positive: sentiment score >=0.5 neutral: -0.5 < sentiment score < 0.5 negative: sentiment score =<-0.5

The following code calculates the sentiment analysis.

The column types is changed from object to float.

Each song is classified as positive, neutral or negative based on their compound score.

Sentiment Distribution

It is interesting to investigate the overall sentiment of the music industry to obtain an overall picture of wheather people like to listen to happy or sad songs. The Vader sentiment analysis has been performed on the data and the distribution of the classes can be seen below.

The distribution is plottet to visualize the sentiment classes.

There are very few songs classified as neutral this indicates that music tent to have an extreame polarised sentiment. Wheather it is very loving or extream angry. This gives an indication that music tries to avoke emotion within the listener. Furthermore the overall picture indicates that there are mostly postive songs.

It should be noted that the dataset does not contain the whole music industry and that the results gives a picture of what the industry looks like.

Sentiment pr genre

To achieve a deeper understanding af the sentiment distribution we have chosen to look at each genre at a time. We can then compare the genres and their senntiment to understand how the genres differences.

We start by makeing sure that we have 50 songs per genre.

The following code prepares the data for at procentile plot below. Since there are 50 songs for each genre a multiplication of 2 will make the procentile plot possible.

The resulting classification of songs within each genre can be seen below.

It is clear to see that some genre tent to have more negative songs where others more positive. As expected hiphop(followed by rock) is the genre with the largest amount of negative songs. Whereas RnB and pop is the genres with the most positive messages. There is a clear indication that depending on the genre we chose to listen to the tone we are exposed to is dependen thereon.

Sentiment evaluation

To evaluate the model performace we look into the most positive and most negative songs. By reading the lyrics we can evaluate if these songs has been classified correctly and thereby get an indication of the models overall performance.

We have googled the lyrics for these 6 songs and evaluated that they are all classified correctly.

Girls Like you: This song is about a boy being madly in love with a girl. As the lyrics says, spending 24 hours together is not even enough for him. There are no doubt that this song has been classified correctly as extreamly positive.

Mirrors: The song was inspired by Justin's grandparents’ relationship of 63 years. It is about finding true love and your other half. As he sings, "You reflect me, I love that about you. And if I could, I would look at us all the time". The lyrics includes many positive sentences with love and disire and it is therefore classified correctly.

Thank u, next: At a first glance of the title "Thank you, next" is kind of ironic and not positive but if you look closer at the lyrics you discover why the sentiment analysis has classified this song as extreamly positive. The song is about learning to love your self as she sings "I've got so much love... I turned out amazing" which illustrates a correct classification after all.

HUMBLE: This song is about the conflicting feelings of success. Throughout the song the lyrcis encouragee powerfull people to be humble and become more down to earth. He sings, “I’m so fuckin’ sick and tired of the photoshop. Show me somethin’ natural like ass with some stretch marks". The song disses the upper class lifestyle and the song is thereby correctly classified as negative.

Cruel Summer: The song deals with the darker side of the traditional summer songs. It is about the sadness of being all by oneself when everyone else is on vacation. "It's a cruel, cruel summer. Leaving me here on my own". Indicating a correct classification of the song.

No role modelz: As this title implies, this song is about J’s lack of role models growing up. Furthermore he attacks Hollywood itself for its focus on superficial people, especially when it comes to women. The lyrics has almost only negative messages and is thereby classified as extreamly negative.

The same evaluation has been performed on random sample. The random songs can be seen below.

We have google the lyrics and evaluated how well the songs has been classified.

You Make it Easy: It is a love song to a partner and thereby very positive.

Two nights part ii: This song is about someone cheating in a relationship. The song is about hert and heart pain and has thereby many negative sentences.

Still into you: This is a song about love that goes on and on. With good mamories about each other even though the time has past. It is correctly classified as positive.

To summarize the evaluation of the model it can be stated that the performane of the Vader Sentiment Model is very accurate when it comes to classifying songs' sentiment.

Sentiment conclusion

This study has extraced meaningful insights into the sentiment and tone of the songs we listen to. The results clearly proves that some genres tent to have more negative messages, which we can take with us when we chose our playlist. Furthermore the analysis showed us that most of the top listen songs actually provides us with happy and positive feelings.

For further investigation of the sentiment within the music industry we recommend a larger dataset with extra genres to provide a more versatile perspective. In addition it would be interesting to chose another sentiment analysis tool to compare the models and their results or even train a new model on songs to achieve an even better sentiment model for the research area.

5.4 Topic Modeling

Topic Modeling goal

The topic modelling part of this study aim to investigate what lyrics we actually listen to and fills our minds with when we play our favourite songs. We want to research if some topics are more popular than others and if the topics depend on the genre. Before implementing the topic model we look at the most frequence words and see if some patterns can be discovered. Next step is text preprocessing where we clean the data, reduce the vocabulary size and thereby prepare the data for the model. The topic modelling is then implemented on the cleaned data to reveal topics within the songs. In order to get an even deeper insight into the topics that our model detected we compare the topics size, their sentiment and distribution of genres. The goal is to understand the words we listen to and to discover if the topics and messages from the lyrics depends on our choice of genre.

Topic Modeling tool

The study will use Latent DirichletAllocation (LDA) to perform topic modeling on the data. The LDA model consists of two main procedures; generating topics and assigning topics to songs. Each song will then be described by a. distribution of topics and each topic will be described by a distribution of words. For a topic model to perform well the input data should only contain meaningfull words since unnecessary words adds too much noise for the model to detect patterns. It is therefore very importain to clean the data and reduce corpus size as much as possible before implementing the model.

When talking about a collection of written text the correct term is corpus. Our corpus is the song lyrics in our dataset. When talking about the vocabulary size it is the unique words within the corpus.

Frequency count befor cleaning

As a priliminary investigation we look at the most frequence words within the corpus to investigate if some topics is revealed.

Top 10 most common words in the dataset:

It is clear that the frequenced words are stopwords without meaning. A thorough text cleaning is needed in order to ensure a qualified topic modeling performance.

Text preprocessing

The purpose of text preprocessing is to reduce vocabulary size by cleaning the corpus. The quality of the topics found by the topic model is highly dependent on the quality of the input data for the model, this text cleaning step is thereby very importain. The corpus will undergo four different cleaning steps in order to produce the input for the topic model.

Noice removal

The first step in the cleaning process is the noice removal. It begins by transforming all words into lowercase, than removing all stopwords such as "the" and "a", followed by a removing of all non-alphabetic characters (!?,.). The larst part of the noice removal is to lemmatize the words, this means to extract the dictionary form of all words.

Remove rare words and words of len 1

A very simple but efficient text cleaning step in terms of vocabulary reduction, is to remove rare and short words. It is safe to say that the words that only appears once in the entire corpus is without importance and can thereby be deleted. Words of only one character is deemed to have no importance for the topics as well.

Noun, verb and adj

It has been evaluated that nouns, verbs and adjectives carry most information on the message of the songs. All other words that are not labeled ”noun”, ”verb” or "adj" are thereby removed.

Meaningless words

As the larst cleaning step we looked at the frequency words and removed words with no particulary meaning for the overall messages. This will help the model perform.

Frequency count after cleaning

It is of interest to examine the most frequent terms after the cleaning to detecting if possible topics appears now.

Top 10 most common words after cleaning:

As hoped the top frequency words for the cleaned data indicates and reveals some possible topics, both some romance, wealth and hate.

Vocabulary size

Below an overview of the full cleaning process can be seen and the resulting vocabulary size after having applied the different cleaning techniques.

The vocabulary size of the raw corpus is 12,647 unique words. This is reduced to 2,724 in the last preprocessing step which yields a total word reduction of 78%.

The text cleaning of the corpus is now finalized and unimportant words have been removed. The cleaned data is now ready to be used as input for the topic model.

Topic Modelling

The topic model is implemented to generate topics and label each song with the most appropriate topic. The model determines the optimal number of topics to represent the lyrics corpus based on a measure of coherence coherence.

It should be noted that the topic model uses random variables and will therefore generate slitly different topics each time the model is implemented. To avoide having different outputs inside this notebook than our homepage presents we have insertede a picture of the output. If you are interested in the code feel free to play around with the variables for the LDA model("passes","iterations", "chunksize", "random_state","eval_every"). Just be aware that you may get other outputs and other topics.

From the coherence measure the optimal number of topics is n=4

The table above shows that topic 3 is the most negative topic(lowest sentiment score) and topic 4 is the most positive(highest sentiment score). This alligns with the results from the sentiment analysis that showed mos positive songs within the music industry.

Model evaluation

To evaluate the topic model performance we have used the below interactive LDA visualizer. With this tool we have examine how the model generated the topics. The chart to the left illustrate how separated the classes are whereas the chart to the right shows the words that reprsents the topics ordered by relevance. The right chart depends on the tunable λ-parameter which is based on TF-IDF. This means that a high value of λ corresponds to importance by term frequency. If the λ-value is decreased there will be given less importance to frequently used words. You can click on the LDA visulizer below and see the words for each topic. Furthermore you can change the λ-value as you please to discover words representing each topic. The LDA visulizer can be seen on our homepage or displayed here by running the code. In order of not having a too large notebook we are not printet it here.

From the inteactive LDA visulaizer we discover that the four topic is nicely seperated, indicating that the cleaning process went well and the model has created some good topic to describe the data.

Another way to evaluate the model performance is to look into some lyrics and evaluate if they have been classified correctly. We do this by taking a random sample of three songs and determine if they match the topic class they were put in.

Too Good At Goodbyes by Sam Smith- Topic 2: It is a breakup song about getting used to being dumped and how previous experiences in relationships makes him protect himself. This falls perfectly into the topic "Ups and downs in romance" and the songs is thereby perfectly classified.

That's what I like by Bruno Mars- Topic 4: This song is about the rich lifestyle with glammer and it is therefore fair to say that it fits the topic "Life experience".

Only by Nicki Minaj- Topic 3: It is an extremely sexistic song where Minaj defends herself for her sexual life. It is therefore no surprise that the model classifies this as "NSFW" (Not Suitable For Work in the form of very sexual songs).

Topic Modeling per genre

We will now performed topic modeling analysis on each genre at a time to discover possible topics within each genre. We start by calculating the coherence for each genre to discover the optimal number of topics per genre. The code below calculates the cohereance and generates subtopics within each genre. These topics can be seen below.

When looking at the output for the topics within each genre we discover that the topic model has a hard time detecting meaningfull topics. This can be due to the fact that we only have 5o songs per genre and that is too few songs for the model to perform topic modeling.

Frequency term

We want to look at the frequency term and TF-IDF per genre. TF is the number of times a term/word $t$ appears in a document $d$ divded by the total number of words in the document. Therefore, every single document also has its own term frequency. With the TF-IDF it ispossible to score the relative importance of words (Maklin, 2019).

The TF makes sence for ech genre.

We will now calculate the TF-IDF.

We discover that the TF-IDF gives way too specific words and the output is not suiatable. The topic mdodel do use TF.IDF to calculate the keywords for each topic but in the topic model we can adjust the lambda value. As you can see in the code where we implemented the topic model we used a high lambda value which assignes more importans to TF.

Topic Modeling conclusion

This study provides an understanding of what the words we listen to is all about. The topic model detected different topics within songs and examines which topics has the higest/lowest sentiment score indicating the emotional influence of the topic. Furthermore the results shows which topic has the highest reprsentstion and what genre the topics occurs mostly in. In the end there were no doubt that the largest topic is "Life experience" which makes sence since this topic is a very broad topic including many kind of songs. Furthermore it was as expected thar HipHop accounted for almost half of the songs in the topic "NSFW" (Not Suitable For Work) which were the most negative topic. The analysis therefore supports manys assumptions that Hip Hop is an aggresive genre.

With this topic part of the study you are provided with a tool to understand what kind of music you should chose depending on the topic you want to listen to and what kind of emotinal tone the topic has.

WorldClouds

An investigation of different wordclouds has been carried out. Wordclouds for topic classes, wordclouds for genres and wordclouds for sentiment groups.

From the wordclouds of the topics it is possible to see that violent gangster and NSFW has wordsclouds that are alike. This makes sence due to the fact that they both have many hiohop songs.

The same goes for the two other topics, ups and downs in romance and life experience. These two topics has wordclouds that contains many of the same words.

From the wordclouds it is clear to see that Pop and RnB are very much alike with words like "love" and "baby" being with signitifact importance. From our network analysis we saw the same, that many Pop artist also make RnB music and vise verse. These two genre are thereby close connected. The words representing HipHop is word there also described the topic "Violent gangster" and "NSFW". This is as expected since HipHop accounted for the largest proportion of each of the topics (respectively 26% and 48%). When it comes to the last two genres, Rock and Country, the words is more even distributed between the topics which relates to their distributions in the topic modeling analysis. Furthermore Country has very few negatively charged words which fits the results from the sentiment analysis namely that country is the genre with the lowest proportion of negative messages

It makes perfectly sence that words like "love", "baby", "heart" and "thank" is a part of the positive wordcloud. These are words that we normally assosiate with a positive message.

As expected the wordcloud for the neutral sentiment class contains words in oppisite emotional directions and quite a lot of words with no sentiment orintations.

Finally the wordcloud for the negative sentiment contains words like "fuck", "bitch" and "shit" which clearly indicate a negative message. Furthermore the wordcloud contains the word "nigga" which is not a negative word but a word many hihop artist uses in their songs. The hiphoppers songs has often a negative message causing their lyrics to be a part of the negative wordclouds ansd thereby also the word "nigga".

Part 6: Discussion and Conclusion

The goal of this study was to investigate different artists and compare different genres to achive a deeper insight into the music industry. The study began with a thorough data collection from different Wikipedia pages, followed by data cleaning and preprocessing. In the section of basic statistics it quickly became clear that Hip Hop (and country) was the most represented genre in the data. The network was build and a backbone analysis was carried out. The backbone used for further analysis only reinforced the representation of the Hip Hop genre, as its proportion of the total network increased. For future work one could work towards achieving a more equal genre distribution. With a more equal genre distribution one might would have discovered new patterns in the data which possibly would be reflected in the partitioning of communities. However the network analysis showed reasonable results despite the unequal genre distribution.

Prior to the backbone analysis the Giant Connected Component (GCC) was plotted. Already in this drawing of the network it was indicated how genres clearly was separated, as the different genres tended to cluster together. However, it was decided to perform a backbone analysis as the network consisted of a lot of insignificant edges and nodes, which only would make the analysis and inspection of the network more difficult and blurred. A weighted network was needed in order to perform the backbone analysis. The weighting was based on the amount of links in between two artists' Wikipedia pages. However, the weighting could have been chosen in several other ways, e.g. by including the amount of followers. By looking at the Wikipedia pages the network tends to favor the more popular artists, due to the fact that they often have more text on their pages. For future work it would be interesting to define the links by collaborations. However this also have its downsides as it might favor some artists in other ways.

The partitioning into communities was based on the GCC of the backbone network. The partitioning seemed reasonable in terms of amount and modularity score. Looking further into the communities only reinforced a good reasoning behind the networks. 38 communities was found and at a first glance it also looked somewhat strange that several of the communities was highly dominated by e.g. the country genre. One prior to the analysis one could think that there would just bust one big commuity for each genre. However, when looking at different attributes such as starting year and origin it made more sense. The Louvain algorithm succeeded in partitioning different groups of artists within the same genre, such that older artists were in one, while newer artists were in another. Nonetheless other methods could have been used for community detection, which might could have revealed other insights. Hence, for further work it could be interesting to combine the Louvain algorithm with other methods.

Next up was the sentiment analysis. Overall the model performed very well and by using Vader the model succeeded in understanding intensity, acronyms and slang within the lyrics. However, the sentiment analysis showed a very low proportion of neutral songs. This might be due to the often extreme emotions in songs, such that the artists really gets its message out. On the other side it might also be due the way that Vader is trained. Vader is trained on social media data, which of course its performance on different texts. This gives rise to opportunities of training models on music lyrics instead, which might increase the performance when tested on the songs in the dataset.

From the topic modeling analysis it was possible to detect topics and classify the songs by these topics. Many people have assumptions of different genres and what they are about. It was possible with the topic analysis to support some of these assumptions, indicating clear pattherns in the music industry. For a more detailed topic analysis it would improve the model if it was trained on a larger dataset. A further investigation into the topics could also be carried out by additional text cleaning for instance replacing synonyms.

Part 7: Contributions

Name Study ID Contribution
Cecilie Kosack s184304 Network Analysis, Basic Stats and project video
Amanda Sommer s184303 Sentiment Analysis, Topic Modelling and homepage
Julius Rasmussen s184288 Network Analysis, Basic Stats and homepage

References

Functions